2024-04-22
LASSO (Least Absolute Shrinkage and Selection Operator), introduced by Robert Tibshirani in 1996 [@Tibshirani1996].
LASSO regression, also known as L1 regularization, is a popular technique used in statistical modeling and machine learning to estimate the relationships between variables and make predictions.
Primary goal of LASSO is to shrink some coefficients to exactly zero, effectively performing variable selection by excluding irrelevant predictors from the model which helps to find a balance between model simplicity and accuracy.
LASSO regression’s versatility across multiple fields illustrates its capability to manage complex datasets effectively, particularly with continuous outcomes.
Zhou et al. [Zhou2022] highlighted LASSO’s ability to identify key economic predictors that assist in strategic decision-making.
This example underscores its utility in economic analysis, where it helps to isolate factors that directly influence continuous economic outcomes like wages, prices, or economic growth.
Lu et al. and Musoro [@Lu2011; @Musoro2014] used LASSO regression to develop models based on gene expression data, advancing our understanding of genetic influences on continuous traits and diseases. Their work illustrates how LASSO can handle vast amounts of biological data to pinpoint critical genetic pathways.
McEligot et al. (2020)[@McEligot2020] employed logistic LASSO to explore how dietary factors, which vary continuously, affect the risk of developing breast cancer. Their findings highlight LASSO’s strength in dealing with complex, high-dimensional datasets in health sciences.
LASSO regression is highly valued in fields ranging from healthcare to finance due to its ability to simplify complex models without sacrificing accuracy. This method’s key strengths include:
-Feature Selection: LASSO can set some coefficients exactly to zero, effectively choosing the most relevant variables from many possibilities. This automatic feature selection helps focus the model on the truly impactful factors. [@Park2008]
-Model Interpretability: By eliminating irrelevant variables, LASSO makes the resulting models easier to understand and communicate, enhancing their practical use. [@Belloni2013]
-Mitigation of Multicollinearity: LASSO addresses issues that arise when predictor variables are highly correlated. It selects one variable from a group of closely related variables, which simplifies the model and avoids redundancy. [@Efron2004]
LASSO enhances linear regression by adding a penalty on the size of the coefficients, aiding in feature selection and improving model interpretability.
LASSO’s objective function:
\[ \min_{\beta} \left\{ \frac{1}{2n} \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij})^2 + \lambda \sum_{j=1}^{p} |\beta_j| \right\} \] - Goal: Minimize Residual Sum of Squares(RSS) with a penalty on the absolute values of coefficients.
-Parameter λ: Balances model complexity against overfitting.
1. Beta Coefficients (\(\beta\))
-These are the parameters of the model, where \(\beta_0\) is the intercept, and \(\beta_j\) are the coefficients for the predictors.
2. Observed Values (\(y_i\))
-These are the responses observed for each observation in the dataset.
3. Predictor Values (\(x_{ij}\))
-These are the values of the predictors for each observation.
4. Residual Sum of Squares (RSS)
-It measures the discrepancies between observed values and predictions, normalized by \(\frac{1}{2n}\) for computational convenience.
5. Regularization Parameter (\(\lambda\))
-This parameter controls the trade-off between fitting the model accurately and keeping the model coefficients small.
6. L1 Penalty
-This term encourages the sparsity of the model by allowing some coefficients to shrink to zero.
LASSO regression starts with the standard linear regression model, which assumes a linear relationship between the independent variables (features) and the dependent variable (target).
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n + \epsilon \] y is the dependent variable (target). β₀, β₁, β₂, …, βₚ are the coefficients (parameters) to be estimated. x₁, x₂, …, xₚ are the independent variables (features). ε represents the error term.
LASSO regression introduces an additional penalty term based on the absolute values of the coefficients.
The choice of the regularization parameter λ is crucial in LASSO regression:
-At λ=0, LASSO equals an ordinary least squares regression, offering no coefficient shrinkage.
-Variable Selection: As λ increases, more coefficients shrink to zero.
-Optimization: Achieved through cross-validation to find the optimal λ.
-Feature Selection: Reduces coefficients of non-essential predictors to zero.
-Regularization: Enhances model generalizability, critical for complex datasets.
-Fields of Application: Finance, healthcare, where accurate prediction is crucial.
-Comparison with MLR: Demonstrates LASSO’s superiority in handling high-dimensional data by selectively including only relevant variables.
Our project aims to explore the impact of various factors on wages using the RetSchool dataset, focusing on how education and demographic variables influence earnings in 1976. We have chosen LASSO regression to address our research questions due to its unique capabilities in dealing with complex datasets and its methodological strengths in feature selection and model accuracy.
Research Questions Addressed
What factors most significantly affect wages in 1976?
How do education and demographic variables influence wage disparities?
Can we predict wage outcomes based on these variables effectively using a simplified model?
-Overview of RetSchool Dataset Variables:
Understanding the variables in the RetSchool dataset, crucial for analyzing socio-economic and educational influences on wages in 1976.
| Variable | Description | Type | Relevance |
|---|---|---|---|
wage76 |
Wages of individuals in 1976 | Continuous | Primary measure of economic status |
age76 |
Age of individuals | Continuous | Analyzes age impact on wages |
grade76 |
Highest grade completed | Continuous | Indicates educational attainment |
col4 |
College education | Binary | Impact of higher education on wages |
exp76 |
Work experience | Continuous | Examines experience influence on wages |
momdad14 |
Lived with both parents at age 14 | Binary | Family structure’s impact on early life outcomes |
sinmom14 |
Lived with a single mother at age 14 | Binary | Focuses on single-mother household impact |
daded |
Father’s education level | Continuous | Paternal education impact on offspring’s outcomes |
momed |
Mother’s education level | Continuous | Maternal education impact |
black |
Racial identification as black | Binary | Used to analyze racial disparities |
south76 |
Residency in the South | Binary | For regional economic analysis |
region |
Geographic region | Categorical | Regional influences on outcomes |
smsa76 |
Urban residency | Binary | Urban versus rural disparities |
Initial data cleaning included addressing missing values through imputation or removal to refine the dataset for detailed analysis.
-Visualization: The right-skewed distribution of exp76 suggests a young, less experienced workforce. -Implications: Reflects entry-level workers predominating in 1976, impacting wage levels and economic conditions.
-Visualization: A histogram and density plot show most workers earned lower wages, with a minority earning significantly more.
-Economic Insights: Highlights income disparities and provides insights into the financial stability of the population.
-Analysis Tool: Visualizes relationships between key variables like wage76, grade76, exp76, and age76.
-Findings: Identifies strong predictors of wages and helps understand economic dynamics of the era.
-Insight: LASSO’s automatic feature selection is pivotal in isolating significant predictors like education level and regional differences, directly impacting wage analysis.
-Benefit: Simplifies the model by focusing only on impactful variables, thus enhancing interpretability, which is critical for formulating effective educational and economic policies.
-Challenge: Overlapping influences of educational attainment and work experience on wages could lead to skewed analytical results.
-Solution: By penalizing the coefficients of correlated predictors, LASSO ensures a more stable and reliable model, addressing multicollinearity without requiring manual intervention.
-Goal: To develop a statistically robust model that stakeholders can easily understand and use.
-Outcome: LASSO’s regularization promotes model simplicity and clarity, providing straightforward insights that are essential for policy-making and strategic educational planning.
-Technique: Incorporates k-fold cross-validation within the LASSO framework to fine-tune the regularization parameter, optimizing model accuracy.
-Advantage: Enhances predictive reliability, crucial for accurately forecasting wage trends based on educational variables, thereby preventing model overfitting.
-Analysis: Compared to traditional regression methods, LASSO effectively manages large datasets with many predictors.
-Result: Demonstrates superior capacity for feature selection and multicollinearity management, making it indispensable for in-depth wage analysis in educational data.
-Variable wage76: Identified as a continuous variable, which benefits significantly from LASSO’s ability to handle continuous data without categorization.
-Importance: Ensures that the nuances and variations in wage data are accurately modeled, providing a deeper understanding of the economic factors at play.
Proper data preparation is critical to ensure the robustness of the statistical analysis:
Handling Missing Data: Key variables with missing data, such as educational background and work experience, were imputed using the median of available data to minimize the impact of outliers.
Removing Incomplete Records: After imputation, records that still contained missing values were removed to maintain the integrity and accuracy of the model analysis.
Visual checks and plots were used to compare the distribution of variables before and after cleaning. These visualizations help confirm that the data cleaning process preserved the underlying structure of the data while improving the quality for analysis.
-Target Variable:
The primary variable of interest, wage76, represents the wages of individuals in 1976 and serves as the dependent variable in our LASSO model.
-Predictor Variables:
Variables selected based on their theoretical relevance to wage determination included education level (grade76, col4), work experience (exp76), and demographic factors (e.g., age, race, geographic location).
With the data now clean and the variables of interest identified, visualizing these can provide deeper insights into their distribution and relationships within the dataset. This helps in understanding the dynamics and potential influences on wages in 1976
Effective feature scaling is essential before fitting the LASSO model to ensure each variable contributes equally to the analysis. This prevents any feature from disproportionately influencing the outcome due to scale variance.
-Standardization Process: All features are normalized to have zero mean and unit variance. This step is crucial for models that apply a penalty on the size of coefficients, such as LASSO.
library(caret)
library(glmnet)
# Selecting only numeric features and excluding the target variable 'wage76'
numeric_features <- select(df_clean, where(is.numeric), -wage76)
# Converting the selected features into a matrix, as required by glmnet
features <- data.matrix(numeric_features)
preProcValues <- preProcess(features, method = c("center", "scale"))
features_scaled <- predict(preProcValues, features)Selecting the optimal regularization parameter, λ, is crucial for balancing the complexity and accuracy of the LASSO model.
Cross-Validation Technique
Cross-validation technique is used to determine the λ that minimizes prediction error. This technique ensures the model performs well on unseen data by validating the model across multiple data subsets.
Figure 6: Cross-Validation Curve
Analyzing the coefficients after fitting the model with the optimal λ reveals which variables significantly influence the dependent variable.
Significance of Coefficients: Coefficients that remain significant (not shrunk to zero) are key predictors of wages.
Interpretation of Results: The size and direction of these coefficients provide insights into how each predictor affects wage levels.
The LASSO model has highlighted key factors that influence wages, efficiently pinpointing the most impactful variables:
-Baseline Wages: The intercept sets the starting point for wage predictions, adjusted for average levels of predictors.
-Educational Attainment (grade76): Higher education levels are strongly associated with increased wages, confirming the significant return on investment in education.
-Racial Disparity (black): There is a noticeable wage gap affecting black individuals, indicating persistent racial inequalities in earnings.
-Geographic Influence (south76): Residing in the South is linked to lower wages, reflecting regional economic differences.
-Urban Premium (smsa76) and Long-term Urban Advantage (smsa66): Both highlight the wage benefits of living in urban areas, both currently and historically.
-Family Stability (momdad14): Growing up with both parents is correlated with higher wages, suggesting the economic benefits of a stable family during childhood.
-Parental Education (momed): Higher maternal education levels positively affect wages, underscoring the influence of parental education.
-Age and Experience (age76): Older age, often accompanied by more experience, typically leads to higher wages, reflecting the value of longevity in the workforce.
-Higher Education (col4): Possessing a college degree shows a positive but modest correlation with higher wages.
Variables with Minimal Impact
Several variables demonstrated minimal to no influence on wages, exemplifying Lasso’s ability to streamline the model by eliminating non-impactful predictors. These include exp76, region, sinmom14, nodaded, nomomed, daded, and famed. This reduction in variables enhances the interpretability of our model without compromising the accuracy of our predictions.
Our analysis with the LASSO model has effectively highlighted the most significant factors influencing wages.
However, to deepen our understanding of these results, we also employed Multiple Linear Regression (MLR) as a comparative tool.
This step allows us to observe the differences in how each model handles the complexity of the dataset and the continuous nature of the wage variable.
-LASSO Regression: Applies a penalty to reduce the influence of less significant predictors, enhancing model simplicity and accuracy.
-MLR: Provides a baseline by including all predictors without regularization, illustrating potential overfitting issues.
Here we analyze both LASSO and Multiple Linear Regression (MLR) to demonstrate how each model manages the complex and continuous nature of the wage variable:
Both models are applied to the same cleaned dataset to ensure a fair comparison:
The analysis of both LASSO and MLR models sheds light on the robustness of each predictor’s impact on wages. Here, we provide a side-by-side view of the coefficients, illustrating how each model values the same predictors:
-Consistency and Differences: Where LASSO might zero out a predictor, MLR may still attribute it with a significant coefficient. Such differences are key in understanding the potential for overfitting in MLR versus the more conservative and potentially more reliable predictions from LASSO.
-Focus on Continuous Outcomes: Both models emphasize the importance of continuous predictors like age76 and grade76, but LASSO does so while maintaining model parsimony, avoiding the pitfalls of including too many variables as MLR might.
A visual representation of coefficient differences effectively illustrates the impact of regularization:
Exploring the specific implications of our findings within the context of the Return to School dataset:
-What We Analyzed: We focused on understanding how various educational, demographic, and work experience factors influence wage disparities in 1976.
-Why It Matters: This analysis is crucial for identifying key areas where educational and economic policies can be targeted to reduce wage inequality.
-How We Did It: Using LASSO and MLR, we were able to discern which variables significantly impact wages, with LASSO providing a more streamlined model that avoids overfitting and highlights the most impactful factors.
This analysis not only enhances academic understanding but also provides concrete data to inform policy makers:
-Policy Recommendations: Insights from the study can guide the development of policies aimed at addressing the root causes of wage disparities identified through the model.
-Educational Impact: By understanding which educational factors influence earnings, institutions can tailor programs to enhance the economic outcomes of their students.
Our comprehensive analysis using LASSO regression has identified pivotal factors that influenced wages in 1976, with a focus on the impact of educational attainment and age.
This study opens the door for further research into additional socioeconomic factors that could affect wage disparities. Future studies could explore the impact of technological advances, economic policies, and other demographic changes on wage trends. Such research would help extend our understanding of the dynamics between education and wages over longer periods and under varying economic conditions.
Thank you Dr.Cohen for your guidance and support throughout the semester. We appreciate everyone for your attention and interest in our findings. We are now open to any questions you may have or discussions you would like to engage in. Your feedback and suggestions for further research areas are highly welcome.